N-th Order Ergodic Multigram HMM for Modeling of Languages without Marked Word Boundaries

نویسندگان

  • Hubert Hin-Cheung Law
  • Chorkin Chan
چکیده

I,;rgodie IIMMs have been successfully used for modeling sentence production. llowever for some oriental languages such as Chinese, a word can consist of multiple characters without word boundary markers between adjacent words in a sentence. This makes wordsegmentation on the training and testing data necessary before ergodic ItMM can be applied as the langnage model. This paper introduces the N-th order Ergodic Mnltigram HMM for language modeling of such languages. Each state of the IIMM can generate a variable number of characters corresponding to one word. The model can be trained without wordsegmented and tagged corpus, and both segmentation and tagging are trained in one single model. Results on its applicw Lion on a Chinese corpus are reported.

منابع مشابه

Ergodic multigram HMM integrating word segmentation and class tagging for Chinese language modeling

A novel Ergodic Multigram Hidden Markov Model (HMM) is introduced which models sentence production as a doubly stochastic process, in which word classes are first produced according to a first order Markov model, and then single or multi-character words are generated independently based on the word classes, without word boundary marked on the sentence. This model can be applied to languages wit...

متن کامل

Stochastic pronunciation modeling by ergodic-HMM of acoustic sub-word units

We propose a stochastic pronunciation model using an ergodic hidden Markov model (EHMM) of automatically derived acoustic sub-word units (SWU). The proposed EHMM discovers the pronunciation structure inherent in the acoustic training data of a word without any apriori phonetic transcriptions. The EHMM is an HMM of HMMs – its states are SWU HMMs and the state-transitions compose various possible...

متن کامل

Language identification using parallel sub-word recognition - an ergodic HMM equivalence

Recently, we have proposed a parallel sub-word recognition (PSWR) system for language identification (LID) in a framework similar to the parallel phone recognition (PPR) approach in the literature, but without requiring phonetic labeling of the speech data in any of the languages in the LID task. In this paper, we show the theoretical equivalence of PSWR and ergodicHMM (E-HMM) based LID. Here, ...

متن کامل

Automatic Segmentation of Continuous Speech on Word and Phrase Level based on Suprasegmental Features

This article investigates whether it is possible to segment continuous speech on word and phrasal level by examination of suprasegmental parameters, in case of bound stress languages like Hungarian and Finnish. The final aim is to increase the robustness of speech recognition on language modelling level by the detection of word and phrase boundaries and so we can significantly decrease the sear...

متن کامل

Single speaker segmentation and inventory selection using dynamic time warping self organization and joint multigram mapping

In speech synthesis the inventory of units is decided by inspection and on the basis of phonological and phonetic expertise. The ephone (or emergent phone) project at CSTR is investigating how self organisation techniques can be applied to build an inventory based on collected acoustic data together with the constraints of a synthesis lexicon. In this paper we will describe a prototype inventor...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

متن کامل
عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1996